Web & Data Science — Final Project
Thilo Hohl, 2093510
In this project, I aim to analyse and compare the 2013–2017 Formula 1 seasons and the teams competing using the processes and techniques acquired in this semesters’ lessons.
Upon forming the idea, I quickly found the article “Formula 1 Team Radio: A text-mining use case” by Marcell Ignéczi, which already touched upon most of the things I’d thought of, but only looked at the 2017 season and only sparingly correlated findings with actual events. I am hoping to go into deeper detail and also compare multiple seasons and the teams’ differences between those years.
Ignéczi’s source, radio broadcast transcripts from racefans.net, also proved to be the most complete and valuable collection of their kind, since, even though full team radios are broadcast in F1TV, this service is unavailable in Germany and I was unable to find any recordings.
racefans.net only transcribed the radio messages contained in the Liberty Media Formula One Broadcast, so a selection of highlights, not everything sent between drivers and teams. Still, since this data exists for every Grand Prix in between the 2013 and 2017, I am hopeful to find statistically significant observations.
For context, we’ll also need the official driver standings for each year.
Formula1.com, the competition’s official web page, provides this data.
data_frames = list()
alldrivers = list()
for (year in 2013:2017) {
domain = read_html(paste("https://www.formula1.com/en/results.html/",year,"/drivers.html", sep = ""))
table = html_nodes(domain, "table")
data = html_table(table, fill = TRUE)[[1]]
# Pick the columns we need
data = data[,2:6]
data = select(data, -c(Nationality))
# Remove Tabs, newline characters, trailing spaces and clean up driver abbreviations
data$Driver = str_replace_all(data$Driver, "[\\t \\n]+", " ")
data$Driver = str_replace_all(data$Driver, "[A-Z]{3}", "")
data$Driver = str_trim(data$Driver)
data$Driver = str_replace_all(data$Driver, "Carlos Sainz", "Carlos Sainz Jnr")
data$Driver = str_replace_all(data$Driver, "Kimi Räikkönen", "Kimi Raikkonen")
for (driver in data$Driver) {
if(!(driver %in% alldrivers)) {
alldrivers = c(alldrivers, driver)
}
}
data$Year = year
var_name = paste0("driverstandings", year)
assign(var_name, data, envir = .GlobalEnv)
data_frames[[year-2012]] = get(var_name, envir = .GlobalEnv)
csv_name = paste0("driverstandings", year, ".csv")
write.csv(data_frames[[year-2012]], csv_name, row.names = F)
}
Now, we’ve got a list of drivers and teams as well as the seasons’ results.
Let’s look at 2017’s standings:
driverstandings2013 = read.csv("driverstandings2013.csv")
driverstandings2014 = read.csv("driverstandings2014.csv")
driverstandings2015 = read.csv("driverstandings2015.csv")
driverstandings2016 = read.csv("driverstandings2016.csv")
driverstandings2017 = read.csv("driverstandings2017.csv")
kable(driverstandings2017)
| Pos | Driver | Car | PTS | Year |
|---|---|---|---|---|
| 1 | Lewis Hamilton | Mercedes | 363 | 2017 |
| 2 | Sebastian Vettel | Ferrari | 317 | 2017 |
| 3 | Valtteri Bottas | Mercedes | 305 | 2017 |
| 4 | Kimi Raikkonen | Ferrari | 205 | 2017 |
| 5 | Daniel Ricciardo | Red Bull Racing TAG Heuer | 200 | 2017 |
| 6 | Max Verstappen | Red Bull Racing TAG Heuer | 168 | 2017 |
| 7 | Sergio Perez | Force India Mercedes | 100 | 2017 |
| 8 | Esteban Ocon | Force India Mercedes | 87 | 2017 |
| 9 | Carlos Sainz Jnr | Renault | 54 | 2017 |
| 10 | Nico Hulkenberg | Renault | 43 | 2017 |
| 11 | Felipe Massa | Williams Mercedes | 43 | 2017 |
| 12 | Lance Stroll | Williams Mercedes | 40 | 2017 |
| 13 | Romain Grosjean | Haas Ferrari | 28 | 2017 |
| 14 | Kevin Magnussen | Haas Ferrari | 19 | 2017 |
| 15 | Fernando Alonso | McLaren Honda | 17 | 2017 |
| 16 | Stoffel Vandoorne | McLaren Honda | 13 | 2017 |
| 17 | Jolyon Palmer | Renault | 8 | 2017 |
| 18 | Pascal Wehrlein | Sauber Ferrari | 5 | 2017 |
| 19 | Daniil Kvyat | Toro Rosso | 5 | 2017 |
| 20 | Marcus Ericsson | Sauber Ferrari | 0 | 2017 |
| 21 | Pierre Gasly | Toro Rosso | 0 | 2017 |
| 22 | Antonio Giovinazzi | Sauber Ferrari | 0 | 2017 |
| 23 | Brendon Hartley | Toro Rosso | 0 | 2017 |
The data exists as separate articles on racefans.net for each Grand Prix, containing a few paragraphs of text and then, conveniently, a table containing lap, radio message destination (to or from certain drivers) and Message Text.
Links to our data look like this, so it’s hard to find a methodology to crawl them, since we don’t know the date they were written:
https://www.racefans.net/2016/03/23/2016-australian-grand-prix-team-radio-transcript-2/
https://www.racefans.net/2017/09/08/2017-italian-grand-prix-team-radio-transcript/
Luckily, racefans.net is searchable. We’ll just search for the term
'grand prix team radio transcript' and collect the links to
the articles that way.
domain = "https://www.racefans.net/?s=Grand+Prix+team+radio+transcript"
# first page
remDr$navigate(domain)
Sys.sleep(sleep) # give the page time to fully load
page = remDr$getPageSource()[[1]]
resultpage = read_html(page)
links = resultpage %>%
html_nodes('.entry-title') %>%
html_nodes('a') %>%
rvest::html_attr('href')
# later pages
for (index in 2:5) {
remDr$navigate(paste("https://www.racefans.net/page/",index,"/?s=Grand+Prix+team+radio+transcript", sep=""))
Sys.sleep(sleep)
page = remDr$getPageSource()[[1]]
resultpage = read_html(page)
morelinks = resultpage %>% html_nodes('.entry-title') %>% html_nodes('a') %>% rvest::html_attr('href')
links = append(links, morelinks)
}
Let’s clean up the links we don’t need. The first link and the last few are different articles, and the 2017 chinese grand prix transcript is corrupted.
links = links[2:86]
links = links[-17]
write.csv(links, "links.csv", row.names = F)
links = read.csv("links.csv")
print(links)
## x
## 1 https://www.racefans.net/2017/11/30/2017-abu-dhabi-grand-prix-team-radio-transcript/
## 2 https://www.racefans.net/2017/11/16/2017-brazilian-grand-prix-team-radio-transcript/
## 3 https://www.racefans.net/2017/11/02/2017-mexican-grand-prix-team-radio-transcript/
## 4 https://www.racefans.net/2017/10/26/2017-united-states-grand-prix-team-radio-transcript/
## 5 https://www.racefans.net/2017/10/11/2017-japanese-grand-prix-team-radio-transcript/
## 6 https://www.racefans.net/2017/10/04/2017-malaysian-grand-prix-team-radio-transcript/
## 7 https://www.racefans.net/2017/09/25/2017-singapore-grand-prix-team-radio-transcript/
## 8 https://www.racefans.net/2017/09/08/2017-italian-grand-prix-team-radio-transcript/
## 9 https://www.racefans.net/2017/08/31/2017-belgian-grand-prix-team-radio-transcript/
## 10 https://www.racefans.net/2017/08/04/2017-hungarian-grand-prix-team-radio-transcript/
## 11 https://www.racefans.net/2017/07/19/2017-british-grand-prix-team-radio-transcript/
## 12 https://www.racefans.net/2017/06/30/2017-azerbaijan-grand-prix-team-radio-transcript/
## 13 https://www.racefans.net/2017/06/14/2017-canadian-grand-prix-team-radio-transcript/
## 14 https://www.racefans.net/2017/06/02/2017-monaco-grand-prix-team-radio-transcript/
## 15 https://www.racefans.net/2017/05/05/2017-russian-grand-prix-team-radio-transcript/
## 16 https://www.racefans.net/2017/04/21/2017-bahrain-grand-prix-team-radio-transcript/
## 17 https://www.racefans.net/2017/03/31/2017-australian-grand-prix-team-radio-transcript/
## 18 https://www.racefans.net/2016/11/30/2016-abu-dhabi-grand-prix-team-radio-transcript/
## 19 https://www.racefans.net/2016/11/16/2016-brazilian-grand-prix-team-radio-transcript/
## 20 https://www.racefans.net/2016/11/01/2016-mexican-grand-prix-team-radio-transcript/
## 21 https://www.racefans.net/2016/10/26/2016-united-states-grand-prix-team-radio-transcript/
## 22 https://www.racefans.net/2016/10/17/2016-japanese-grand-prix-team-radio-transcript/
## 23 https://www.racefans.net/2016/10/05/2016-malaysian-grand-prix-team-radio-transcript/
## 24 https://www.racefans.net/2016/09/21/2016-singapore-grand-prix-team-radio-transcript/
## 25 https://www.racefans.net/2016/09/13/2016-italian-grand-prix-team-radio-transcript/
## 26 https://www.racefans.net/2016/09/13/2016-belgian-grand-prix-team-radio-transcript/
## 27 https://www.racefans.net/2016/08/03/2016-german-grand-prix-team-radio-transcript/
## 28 https://www.racefans.net/2016/08/03/2016-hungarian-grand-prix-team-radio-transcript/
## 29 https://www.racefans.net/2016/07/18/2016-british-grand-prix-team-radio-transcript/
## 30 https://www.racefans.net/2016/07/07/2016-austrian-grand-prix-team-radio-transcript/
## 31 https://www.racefans.net/2016/06/22/2016-european-grand-prix-team-radio-transcript/
## 32 https://www.racefans.net/2016/06/16/2016-canadian-grand-prix-team-radio-transcript/
## 33 https://www.racefans.net/2016/05/04/2016-russian-grand-prix-team-radio-transcript/
## 34 https://www.racefans.net/2016/04/22/2016-chinese-grand-prix-team-radio-transcript/
## 35 https://www.racefans.net/2016/04/06/2016-bahrain-grand-prix-team-radio-transcript/
## 36 https://www.racefans.net/2016/03/23/2016-australian-grand-prix-team-radio-transcript-2/
## 37 https://www.racefans.net/2015/12/01/2015-abu-dhabi-grand-prix-team-radio-transcript/
## 38 https://www.racefans.net/2015/11/17/2015-brazilian-grand-prix-team-radio-transcript/
## 39 https://www.racefans.net/2015/11/04/2015-mexican-grand-prix-team-radio-transcript/
## 40 https://www.racefans.net/2015/10/29/2015-united-states-grand-prix-team-radio-transcript/
## 41 https://www.racefans.net/2015/10/14/2015-russian-grand-prix-team-radio-transcript/
## 42 https://www.racefans.net/2015/09/30/2015-japanese-grand-prix-team-radio-transcript/
## 43 https://www.racefans.net/2015/09/23/2015-singapore-grand-prix-team-radio-transcript/
## 44 https://www.racefans.net/2015/09/09/2015-italian-grand-prix-team-radio-transcript/
## 45 https://www.racefans.net/2015/08/26/2015-belgian-grand-prix-team-radio-transcript/
## 46 https://www.racefans.net/2015/07/29/2015-hungarian-grand-prix-team-radio-transcript/
## 47 https://www.racefans.net/2015/07/08/2015-british-grand-prix-team-radio-transcript/
## 48 https://www.racefans.net/2015/06/25/2015-austrian-grand-prix-team-radio-transcript/
## 49 https://www.racefans.net/2015/06/10/2015-canadian-grand-prix-team-radio-transcript/
## 50 https://www.racefans.net/2015/05/27/2015-monaco-grand-prix-team-radio-transcript/
## 51 https://www.racefans.net/2015/05/13/2015-spanish-grand-prix-team-radio-transcript/
## 52 https://www.racefans.net/2015/04/24/2015-bahrain-grand-prix-team-radio-transcript/
## 53 https://www.racefans.net/2015/04/23/2015-chinese-grand-prix-team-radio-transcript/
## 54 https://www.racefans.net/2014/11/27/2014-abu-dhabi-grand-prix-team-radio-transcript/
## 55 https://www.racefans.net/2014/11/14/2014-brazilian-grand-prix-team-radio-transcript/
## 56 https://www.racefans.net/2014/11/05/2014-united-states-grand-prix-team-radio-transcript/
## 57 https://www.racefans.net/2014/10/15/2014-russian-grand-prix-team-radio-transcript/
## 58 https://www.racefans.net/2014/10/08/2014-japanese-grand-prix-team-radio-transcript/
## 59 https://www.racefans.net/2014/09/24/2014-singapore-grand-prix-team-radio-transcript/
## 60 https://www.racefans.net/2014/09/10/2014-italian-grand-prix-team-radio-transcript/
## 61 https://www.racefans.net/2014/08/31/2014-belgian-grand-prix-team-radio-transcript/
## 62 https://www.racefans.net/2014/07/30/2014-hungarian-grand-prix-team-radio-transcript-2/
## 63 https://www.racefans.net/2014/07/23/2014-german-grand-prix-team-radio-transcript/
## 64 https://www.racefans.net/2014/07/09/2014-british-grand-prix-team-radio-transcript/
## 65 https://www.racefans.net/2014/06/25/2014-austrian-grand-prix-team-radio-transcript/
## 66 https://www.racefans.net/2014/06/18/2014-canadian-grand-prix-team-radio-transcript/
## 67 https://www.racefans.net/2014/04/23/2014-chinese-grand-prix-team-radio-transcript/
## 68 https://www.racefans.net/2014/04/09/2014-bahrain-grand-prix-team-radio-transcript/
## 69 https://www.racefans.net/2014/04/02/2014-malaysian-grand-prix-team-radio-transcript/
## 70 https://www.racefans.net/2014/03/18/2014-australian-grand-prix-team-radio-transcript/
## 71 https://www.racefans.net/2013/11/26/2013-brazilian-grand-prix-team-radio-transcript/
## 72 https://www.racefans.net/2013/11/20/2013-united-states-grand-prix-team-radio-transcript/
## 73 https://www.racefans.net/2013/11/09/2013-abu-dhabi-grand-prix-team-radio-transcript/
## 74 https://www.racefans.net/2013/11/01/2013-indian-grand-prix-team-radio-transcript/
## 75 https://www.racefans.net/2013/10/17/2013-japanese-grand-prix-team-radio-transcript-2/
## 76 https://www.racefans.net/2013/10/09/2013-korean-grand-prix-team-radio-transcript/
## 77 https://www.racefans.net/2013/09/26/2013-singapore-grand-prix-team-radio-transcript/
## 78 https://www.racefans.net/2013/09/13/2013-italian-grand-prix-team-radio-transcript/
## 79 https://www.racefans.net/2013/08/29/2013-belgian-grand-prix-team-radio-transcript/
## 80 https://www.racefans.net/2013/08/02/2013-hungarian-grand-prix-team-radio-transcript/
## 81 https://www.racefans.net/2013/07/11/2013-german-grand-prix-team-radio-transcript/
## 82 https://www.racefans.net/2013/07/03/2013-british-grand-prix-team-radio-transcript/
## 83 https://www.racefans.net/2013/06/25/2013-canadian-grand-prix-team-radio-transcript/
## 84 https://www.racefans.net/2013/05/29/2013-monaco-grand-prix-team-radio/
Now, we can extract the radio transmissions from our links.
We’ll need to account for two different table formats, as the page
changed them slightly after 2015. We also have to filter out emphasized
(<em>) text, which is context given by the author and
not message content.
Season and Location are added from the header in two columns
Year and GP.
radio = data.frame(matrix(ncol = 5, nrow = 0))
cols = c('`Lap**`', 'Driver', 'Message', 'Year', 'GP')
colnames(radio) = cols
for (link in 1:84) {#1:84) {
remDr$navigate(links[link, 1])
Sys.sleep(sleep)
page = remDr$getPageSource()[[1]]
page = read_html(page)
table = html_nodes(page, "table")
if(link == 27) { dataframe = html_table(table, fill = TRUE)[[2]] }
else { dataframe = html_table(table, fill = TRUE)[[1]] }
# Format: Lap | Driver | Message
title = html_nodes(page, '.entry-title') %>% html_text
dataframe$Year = str_extract(title, "\\b\\w+")
dataframe$GP = unlist(str_extract_all(title, "(?<=\\d\\d\\d\\d )[\\w ]+(?= Grand)"))
# Format: Lap | To | From | Message | +Year | +GP
if (dataframe$Year[1] < 2016) {
for (row in 1:nrow(dataframe)) {
if (dataframe$To[row] %in% alldrivers) {
dataframe$Driver[row] = paste("To",dataframe$To[row], sep=" ")
} else {
dataframe$Driver[row] = paste("From",dataframe$From[row], sep=" ")
}
}
dataframe = select(dataframe, -c(To, From))
}
radio = rbind(radio, dataframe)
write.csv(radio, "radio.csv", row.names = F)
}
pradio = read.csv("radio.csv")
Now we’ve got our data sorted by year and Grand Prix in a dataframe.
Let’s add a column for Team and Driver name, since that’s not in our
dataframe yet. We’ll also add the column Dir, indicating
the direction of the message (T = to and F =
from the Driver).
# Extract dir and Name from Driver
pradio$Driver_Name = str_replace_all(pradio$Driver, "(To )*(From )*", "")
pradio$Dir = substring(pradio$Driver, 1,1)
pradio$Driver = pradio$Driver_Name
pradio = select(pradio, -Driver_Name)
# Add team by Year and Driver
driverstandings = bind_rows(
driverstandings2013, driverstandings2014, driverstandings2015, driverstandings2016, driverstandings2017
)
pradio = pradio %>%
left_join(driverstandings, by = c("Driver", "Year")) %>%
select(Driver, Dir, Team = Car, Message, Lap = Lap., GP, Year, DPos = Pos)
# Merge Team Names
pradio$Team = str_replace_all(pradio$Team, "Red Bull Racing TAG Heuer", "Red Bull")
pradio$Team = str_replace_all(pradio$Team, "Red Bull Racing Renault", "Red Bull")
pradio$Team = str_replace_all(pradio$Team, "McLaren Mercedes", "McLaren")
pradio$Team = str_replace_all(pradio$Team, "McLaren Honda", "McLaren")
pradio$Team = str_replace_all(pradio$Team, "Lotus Mercedes", "Lotus")
pradio$Team = str_replace_all(pradio$Team, "Lotus Renault", "Lotus")
pradio$Team = str_replace_all(pradio$Team, "Haas Ferrari", "Haas")
pradio$Team = str_replace_all(pradio$Team, "Force India Mercedes", "Force India")
pradio$Team = str_replace_all(pradio$Team, "STR Renault", "Toro Rosso")
pradio$Team = str_replace_all(pradio$Team, "STR Ferrari", "Toro Rosso")
pradio$Team = str_replace_all(pradio$Team, "Toro Rosso Ferrari", "Toro Rosso")
pradio$Team = str_replace_all(pradio$Team, "Williams Mercedes", "Williams")
pradio$Team = str_replace_all(pradio$Team, "Williams Renault", "Williams")
pradio$Team = str_replace_all(pradio$Team, "Caterham Renault", "Caterham")
pradio$Team = str_replace_all(pradio$Team, "Marussia Ferrari", "Marussia/Manor")
pradio$Team = str_replace_all(pradio$Team, "Marussia Cosworth", "Marussia/Manor")
pradio$Team = str_replace_all(pradio$Team, "MRT Mercedes", "Marussia/Manor")
pradio$Team = str_replace_all(pradio$Team, "Sauber Ferrari", "Sauber")
driverstandings$Car = str_replace_all(driverstandings$Car, "Red Bull Racing TAG Heuer", "Red Bull")
driverstandings$Car = str_replace_all(driverstandings$Car, "Red Bull Racing Renault", "Red Bull")
driverstandings$Car = str_replace_all(driverstandings$Car, "McLaren Mercedes", "McLaren")
driverstandings$Car = str_replace_all(driverstandings$Car, "McLaren Honda", "McLaren")
driverstandings$Car = str_replace_all(driverstandings$Car, "Lotus Mercedes", "Lotus")
driverstandings$Car = str_replace_all(driverstandings$Car, "Lotus Renault", "Lotus")
driverstandings$Car = str_replace_all(driverstandings$Car, "Haas Ferrari", "Haas")
driverstandings$Car = str_replace_all(driverstandings$Car, "Force India Mercedes", "Force India")
driverstandings$Car = str_replace_all(driverstandings$Car, "STR Renault", "Toro Rosso")
driverstandings$Car = str_replace_all(driverstandings$Car, "STR Ferrari", "Toro Rosso")
driverstandings$Car = str_replace_all(driverstandings$Car, "Toro Rosso Ferrari", "Toro Rosso")
driverstandings$Car = str_replace_all(driverstandings$Car, "Williams Mercedes", "Williams")
driverstandings$Car = str_replace_all(driverstandings$Car, "Williams Renault", "Williams")
driverstandings$Car = str_replace_all(driverstandings$Car, "Caterham Renault", "Caterham")
driverstandings$Car = str_replace_all(driverstandings$Car, "Marussia Ferrari", "Marussia/Manor")
driverstandings$Car = str_replace_all(driverstandings$Car, "Marussia Cosworth", "Marussia/Manor")
driverstandings$Car = str_replace_all(driverstandings$Car, "MRT Mercedes", "Marussia/Manor")
driverstandings$Car = str_replace_all(driverstandings$Car, "Sauber Ferrari", "Sauber")
teamColors = c(
"Mercedes" = "grey",
"Haas" = "black",
"McLaren" = "orange",
"Ferrari" = "red",
"Force India" = "pink",
"Renault" = "yellow",
"Williams" = "skyblue",
"Red Bull" = "darkblue",
"Toro Rosso" = "blue",
"Sauber" = "darkred",
"Marussia/Manor" = "#ff5a5e",
"Lotus" = "gold",
"Caterham" = "darkgreen"
)
Our data is formatted and finished! Let’s look at some samples:
kable(pradio[sample(nrow(pradio), 5), ])
| Driver | Dir | Team | Message | Lap | GP | Year | DPos | |
|---|---|---|---|---|---|---|---|---|
| 3325 | Max Verstappen | T | Red Bull | Well done. As expected. You did a really solid job. | VL | Australian | 2017 | 6 |
| 15545 | Felipe Massa | T | Ferrari | Understood: ‘Bent’. | 24 | Canadian | 2013 | 8 |
| 5081 | Daniel Ricciardo | T | Red Bull | Okay mate, it’s time to push now. | 25 | Bahrain | 2016 | 3 |
| 7767 | Carlos Sainz Jnr | F | Toro Rosso | Seven. | 25 | Austrian | 2015 | 15 |
| 15147 | Kimi Raikkonen | T | Lotus | We understood that tyres are not bad. | 47 | German | 2013 | 5 |
The Codes in the Lap column mean:
while a number equals the lap number this radio transmission was played (meaning sent, in most cases) during the race.
For ease of use, we’ll divide pradio into different
sizes based on the number of messages sent to the driver.
numMessages = table(unlist(pradio$Driver))
sortedNumMessages = sort(numMessages, decreasing = TRUE)
topNames = as.data.frame(sortedNumMessages[1:39])
# Top 39 Drivers
pradioTop = subset(pradio, Driver %in% topNames$Var1)
# Top 20 Drivers
pradioTop20 = subset(pradio, Driver %in% topNames$Var1[1:20])
# Top Drivers
pradioTop10 = subset(pradio, Driver %in% topNames$Var1[1:10])
The data is distributed as follows:
pradioTop10
10,301 (64,5%)
pradioTop20
14,158 (88,6%)
pradioTop
15,875 (99,4%)
Now we’ve got 15,962 Messages to play with.
For general context, here’s the driver’s championship points plotted over the years in our data:
ds10 = subset(driverstandings, Driver %in% topNames$Var1[1:10])
ggplot(ds10, aes(x = Year, y = PTS, group = Driver, color = Car, label = Driver)) +
geom_line(linewidth = 2, lineend = "round") +
geom_text(nudge_x = 0, vjust = -0.5, size = 3) +
labs(title = "Driver Points by Year", x = "Year", y = "Driver Points") +
scale_color_manual(values = teamColors) +
theme_minimal() +
theme(legend.position = "bottom")
Let’s add wordLength to our dataframe for some further
analysis.
pradio$wordLength = str_count(pradio$Message, "\\w+")
# Update Subsets
pradioTop = subset(pradio, Driver %in% topNames$Var1)
pradioTop20 = subset(pradio, Driver %in% topNames$Var1[1:20])
pradioTop10 = subset(pradio, Driver %in% topNames$Var1[1:10])
Descriptive statistics are not too descriptive in this case:
summary(pradio)
## Driver Dir Team Message
## Length:15962 Length:15962 Length:15962 Length:15962
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Lap GP Year DPos
## Length:15962 Length:15962 Min. :2013 Min. : 1.00
## Class :character Class :character 1st Qu.:2014 1st Qu.: 3.00
## Mode :character Mode :character Median :2015 Median : 7.00
## Mean :2015 Mean : 8.01
## 3rd Qu.:2016 3rd Qu.:12.00
## Max. :2017 Max. :24.00
## NA's :85
## wordLength
## Min. : 1.00
## 1st Qu.: 6.00
## Median : 11.00
## Mean : 14.36
## 3rd Qu.: 19.00
## Max. :150.00
## NA's :1
ggplot(topNames[1:10,], aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), vjust=1.6, color="white", size=5)+
labs(x="Driver", y = "Number of Messages", title = "Message Distribution (Top 10 Drivers)") +
scale_color_manual(teamColors) +
theme_minimal()
numMessages = table(unlist(pradio$Team))
sortedNumMessages = sort(numMessages, decreasing = TRUE)
topTeams = as.data.frame(sortedNumMessages)
ggplot(topTeams[1:10,], aes(x=Var1, y=Freq, fill=Var1)) +
geom_bar(stat="identity")+
geom_text(aes(label=Freq), vjust=1.6, color="white", size=5)+
labs(x="Team", y = "Number of Messages", title = "Message Distribution (Top 10 Teams)") +
scale_fill_manual(values=teamColors) +
theme_minimal() +
theme(legend.position = "none")
teamm = nrow(subset(pradio, Dir == "T"))
driverm = nrow(subset(pradio, Dir == "F"))
prop = c(teamm, driverm)
pie(prop, labels = c(paste("Team to Driver (", teamm, ")", sep=""), paste("Driver to Team (", driverm, ")", sep="")), border="white", col=c("steelblue", "darkgrey"))
pradio$Year = factor(pradio$Year)
words_year = pradio %>% group_by(Year) %>% summarize(wordLength)
ggplot(words_year, aes(x = Year, y = wordLength, fill = Year)) +
geom_boxplot(outlier.color="black", outlier.shape=16, outlier.size = 2) +
labs(title = "Average Message Length (Words)", x = "Year", y = "Average N. of Words per Message") +
scale_y_log10() +
theme_minimal()
Let’s see if message length is a factor for success:
message_len = pradio %>%
group_by(Year, DPos) %>%
summarize(meanLength = mean(wordLength))
ggplot(message_len, aes(x = DPos, y = meanLength, color = Year)) +
geom_point() +
labs(
title = "Avg. Message Length (Words) by Championship Position",
x = "Position (Driver Championship)",
y = "Mean message length (Words)"
) +
geom_smooth(se=FALSE) +
xlim(1,24) +
scale_x_reverse() +
theme_minimal() +
theme(legend.position = "bottom")
message_count = pradio %>%
group_by(Year, DPos) %>%
summarize(NumMessages = n())
ggplot(message_count, aes(x = DPos, y = NumMessages, color = Year)) +
geom_point() +
labs(title = "Avg. Number of Messages by Championship Position", x = "Position (Driver Championship)", y = "Number of Messages broadcast") +
geom_smooth(se=FALSE) +
xlim(1,24) +
scale_x_reverse() +
theme_minimal() +
theme(legend.position = "bottom")
Let’s create word clouds to see the most frequent words for each team:
team_data_frames = pradio %>%
group_split(Team)
team_freq = lapply(team_data_frames, function(team_df) {
corpus = Corpus(VectorSource(team_df$Message))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removeWords, stopwords("english"))
term_freq = TermDocumentMatrix(corpus)
term_freq = as.matrix(term_freq)
word_freq = rowSums(term_freq)
word_freq = sort(word_freq, decreasing = TRUE)
word_freq = data.frame(word = names(word_freq), freq = word_freq)
return(word_freq)
})
Since we have lap numbers, let’s see how the topics and words used changed throughout the race:
scopes = list(
c('PR', 'FL'), # Pre-Race
seq(1, 30), # Laps 1-30
seq(31, 78), # Laps >30
'VL' # Victory Lap
)
Drivers drive a Formation Lap to warm up their tyres, practise for the start and line up on the grid.
The Victory Lap or cooldown lap serves to return drivers to the pit entry, since they pass it on the way to the finish line. In the broadcast, it’s mostly used as an opportunity for Driver and Team to celebrate.
What do both parties focus on?
party_data_frames = pradio %>%
group_split(Dir)
party_freq = lapply(party_data_frames, function(df) {
corpus = Corpus(VectorSource(df$Message))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
term_freq = TermDocumentMatrix(corpus)
term_freq = as.matrix(term_freq)
word_freq = rowSums(term_freq)
word_freq = sort(word_freq, decreasing = TRUE)
word_freq = data.frame(word = names(word_freq), freq = word_freq)
return(word_freq)
})
wordcloud2(party_freq[[1]], color ="darkgrey")
wordcloud2(party_freq[[2]], color ="steelblue")
Let’s compare lexicons to see which will give better results with our data.
sentiments = get_sentiments("bing")
# Tokenize the text messages
pradio_tokens = pradio %>%
unnest_tokens(word, Message)
pradio_sentiment = pradio_tokens %>%
inner_join(sentiments, by = c(word = "word"))
kable(head(pradio))
| Driver | Dir | Team | Message | Lap | GP | Year | DPos | wordLength |
|---|---|---|---|---|---|---|---|---|
| Lewis Hamilton | F | Mercedes | It’s quite windy out there. | PR | Abu Dhabi | 2017 | 1 | 6 |
| Lewis Hamilton | T | Mercedes | To Hamilton: Still headwind turn two, it may have rotated compared to yesterday so more of a tailwind turn eight. Still predominantly the same direction, though. | PR | Abu Dhabi | 2017 | 1 | 26 |
| Romain Grosjean | F | Haas | Grosjean: Thanks for waiting, I’ve finished my drink. | PR | Abu Dhabi | 2017 | 13 | 9 |
| Romain Grosjean | T | Haas | No problem, that was a big drink. | PR | Abu Dhabi | 2017 | 13 | 7 |
| Lewis Hamilton | F | Mercedes | Is the temperature still dropping? | PR | Abu Dhabi | 2017 | 1 | 5 |
| Lewis Hamilton | T | Mercedes | Yeah it should be dropping. The sun’s just disappearing behind the grandstand. It should be down. | PR | Abu Dhabi | 2017 | 1 | 17 |
nrow(pradio_sentiment)
## [1] 14861
kable(head(pradio_sentiment))
| Driver | Dir | Team | Lap | GP | Year | DPos | wordLength | word | sentiment |
|---|---|---|---|---|---|---|---|---|---|
| Romain Grosjean | T | Haas | PR | Abu Dhabi | 2017 | 13 | 7 | problem | negative |
| Fernando Alonso | T | McLaren | PR | Abu Dhabi | 2017 | 15 | 17 | soft | positive |
| Fernando Alonso | T | McLaren | PR | Abu Dhabi | 2017 | 15 | 17 | soft | positive |
| Lewis Hamilton | F | Mercedes | FL | Abu Dhabi | 2017 | 1 | 6 | fumes | negative |
| Lewis Hamilton | F | Mercedes | FL | Abu Dhabi | 2017 | 1 | 6 | pretty | positive |
| Lewis Hamilton | F | Mercedes | FL | Abu Dhabi | 2017 | 1 | 6 | strong | positive |
sentiments = get_sentiments("afinn")
pradio_sentiment = pradio_tokens %>%
inner_join(sentiments, by = c(word = "word"))
pradioTop10_sentiment = pradioTop10 %>%
unnest_tokens(word, Message) %>%
inner_join(sentiments, by = c(word = "word"))
nrow(pradio_sentiment)
## [1] 15491
kable(head(pradio_sentiment))
| Driver | Dir | Team | Lap | GP | Year | DPos | wordLength | word | value |
|---|---|---|---|---|---|---|---|---|---|
| Romain Grosjean | F | Haas | PR | Abu Dhabi | 2017 | 13 | 9 | thanks | 2 |
| Romain Grosjean | T | Haas | PR | Abu Dhabi | 2017 | 13 | 7 | no | -1 |
| Romain Grosjean | T | Haas | PR | Abu Dhabi | 2017 | 13 | 7 | problem | -2 |
| Romain Grosjean | T | Haas | PR | Abu Dhabi | 2017 | 13 | 7 | big | 1 |
| Lewis Hamilton | T | Mercedes | PR | Abu Dhabi | 2017 | 1 | 17 | yeah | 1 |
| Lewis Hamilton | F | Mercedes | FL | Abu Dhabi | 2017 | 1 | 6 | pretty | 1 |
AFINN performs slightly better and also gives us more
detail to work with. Let’s go with that!
ggplot(pradio_sentiment, aes(x = Team, y = value, fill = Team)) +
geom_boxplot(outlier.color="black", outlier.shape=16, outlier.size = 2) +
labs(title = "Sentiment per Team", x = "Team", y = "Sentiment Score (AFINN)") +
scale_fill_manual(values=teamColors) +
theme_minimal() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
ggplot(pradioTop10_sentiment, aes(x = Driver, y = value)) +
geom_boxplot(outlier.color="black", outlier.shape=16, outlier.size = 2) +
labs(title = "Sentiment of pradioTop10 Drivers", x = "Driver", y = "Sentiment Score (AFINN)") +
theme_minimal() +
theme(legend.position = "bottom") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
pradio_sentiment$Lap <- ifelse(pradio_sentiment$Lap == "PR", 0,
ifelse(pradio_sentiment$Lap == "FL", 0,
ifelse(pradio_sentiment$Lap == "VL", max(pradio_sentiment$Lap <80),
pradio_sentiment$Lap)))
pradio_sentiment$Lap <- as.integer(pradio_sentiment$Lap)
average_lap_sentiment <- pradio_sentiment %>%
group_by(Lap, GP) %>%
summarize(avg_sentiment = mean(value))
average_lap_sentiment %>%
ggplot(aes(x=Lap, y=avg_sentiment, group=GP, fill=GP)) +
geom_line(linewidth = .2, lineend = "round", color="black") +
geom_smooth(method="lm", linewidth = .3, lineend = "round", color="red") +
labs(title = "Average Sentiment during the Race per GP", x = "Lap", y = "Average Sentiment Score (AFINN)") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8),
plot.title = element_text(size=14)
) +
facet_wrap(~GP, ncol=4)
average_team_sentiment <- pradio_sentiment %>%
filter(!is.na(Team)) %>%
group_by(Team, Year) %>%
summarize(avg_sentiment = mean(value))
ggplot(average_team_sentiment, aes(x = Year, y = avg_sentiment, group = Team, color = Team)) +
geom_line(linewidth = 2, lineend = "round") +
labs(title = "Average Sentiment per Team over the Years", x = "Year", y = "Average Sentiment Score (AFINN)") +
scale_color_manual(values = teamColors) +
theme_minimal() +
theme(legend.position = "bottom")
team_sentiment_years = average_team_sentiment %>%
left_join(pradio_sentiment, by = c("Team", "Year"))
ggplot(team_sentiment_years, aes(y = avg_sentiment, x = DPos, color = Team)) +
geom_point() +
geom_smooth(method = "lm", color="red", se=TRUE) +
labs(title = "Correlation between Average Sentiment and Driver Position", y = "Average Sentiment Score (AFINN)", x = "Driver Position") +
theme_minimal() +
scale_x_reverse() +
scale_color_manual(values = teamColors) +
theme(legend.position = "bottom")